22 research outputs found

    Inducing the Cross-Disciplinary Usage of Morphological Language Data Through Semantic Modelling

    Get PDF
    Despite the enormous technological advancements in the area of data creation and management the vast majority of language data still exists as digital single-use artefacts that are inaccessible for further research efforts. At the same time the advent of digitisation in science increased the possibilities for knowledge acquisition through the computational application of linguistic information for various disciplines. The purpose of this thesis, therefore, is to create the preconditions that enable the cross-disciplinary usage of morphological language data as a sub-area of linguistic data in order to induce a shared reusability for every research area that relies on such data. This involves the provision of morphological data on the Web under an open license and needs to take the prevalent diversity of data compilation into account. Various representation standards emerged across single disciplines which lead to heterogeneous data that differs with regard to complexity, scope and data formats. This situation requires a unifying foundation enabling direct reusability. As a solution to fill the gap of missing open data and to overcome the presence of isolated datasets a semantic data modelling approach is applied. Being rooted in the Linked Open Data (LOD) paradigm it pursues the creation of data as uniquely identifiable resources that are realised as URIs, accessible on the Web, available under an open license, interlinked with other resources, and adhere to Linked Data representation standards such as the RDF format. Each resource then contributes to the LOD cloud in which they are all interconnected. This unification results from ontologically shared bases that formally define the classification of resources and their relation to other resources in a semantically interoperable manner. Subsequently, the possibility of creating semantically structured data has sparked the formation of the Linguistic Linked Open Data (LLOD) research community and LOD sub-cloud containing primarily language resources. Over the last decade, ontologies emerged mainly for the domain of lexical language data which lead to a significant increase in Linked Data-based linguistic datasets. However, an equivalent model for morphological data is still missing, leading to a lack of this type of language data within the LLOD cloud. This thesis presents six publications that are concerned with the peculiarities of morphological data and the exploration of their semantic representation as an enabler of cross-disciplinary reuse. The Multilingual Morpheme Ontology (MMoOn Core) as well as an architectural framework for morphemic dataset creation as RDF resources are proposed as the first comprehensive domain representation model adhering to the LOD paradigm. It will be shown that MMoOn Core permits the joint representation of heterogeneous data sources such as interlinear glossed texts, inflection tables, the outputs of morphological analysers, lists of morphemic glosses or word-formation rules which are all equally labelled as “morphological data” across different research areas. Evidence for the applicability and adequacy of the semantic modelling entailed by the MMoOn Core ontology is provided by two datasets that were transformed from tabular data into RDF: the Hebrew Morpheme Inventory and Xhosa RDF dataset. Both further demonstrate how their integration into the LLOD cloud - by interlinking them with external language resources - yields insights that could not be obtained from the initial source data. Altogether the research conducted in this thesis establishes the foundation for an interoperable data exchange and the enrichment of morphological language data. It strives to achieve the broader goal of advancing language data-driven research by overcoming data barriers and discipline boundaries

    Translation-Based Dictionary Alignment for Under-Resourced Bantu Languages

    Get PDF
    Despite a large number of active speakers, most Bantu languages can be considered as under- or less-resourced languages. This includes especially the current situation of lexicographical data, which is highly unsatisfactory concerning the size, quality and consistency in format and provided information. Unfortunately, this does not only hold for the amount and quality of data for monolingual dictionaries, but also for their lack of interconnection to form a network of dictionaries. Current endeavours to promote the use of Bantu languages in primary and secondary education in countries like South Africa show the urgent need for high-quality digital dictionaries. This contribution describes a prototypical implementation for aligning Xhosa, Zimbabwean Ndebele and Kalanga language dictionaries based on their English translations using simple string matching techniques and via WordNet URIs. The RDF-based representation of the data using the Bantu Language Model (BLM) and - partial - references to the established WordNet dataset supported this process significantly

    On the linguistic linked open data infrastructure

    Get PDF
    In this paper we describe the current state of development of the Linguistic Linked Open Data (LLOD) infrastructure, an LOD(sub-)cloud of linguistic resources, which covers various linguistic data bases, lexicons, corpora, terminology and metadata repositories.We give in some details an overview of the contributions made by the European H2020 projects “PrĂȘt-Ă -LLOD” (‘Ready-to-useMultilingual Linked Language Data for Knowledge Services across Sectors’) and “ELEXIS” (‘European Lexicographic Infrastructure’) to the further development of the LLOD

    Challenges for the representation of morphology in ontology lexicons

    Get PDF
    Recent years have experienced a growing trend in the publication of language resources as Linguistic Linked Data (LLD) to enhance their discovery, reuse and the interoperability of tools that consume language data. To this aim, the OntoLex-lemon model has emerged as a de facto standard to represent lexical data on the Web. However, traditional dictionaries contain a considerable amount of morphological information which is not straightforwardly representable as LLD within the current model. In order to fill this gap a new Morphology Module of OntoLex-lemon is currently being developed. This paper presents the results of this model as on-going work as well as the underlying challenges that emerged during the module development. Based on the MMoOn Core ontology, it aims to account for a wide range of morphological information, ranging from endings to derive whole paradigms to the decomposition and generation of lexical entries which is in compliance to other OntoLex-lemon modules and facilitates the encoding of complex morphological data in ontology lexicons

    The Open Linguistics Working Group: developing the Linguistic Linked Open Data cloud

    Get PDF
    The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data

    MMoOn Core – the Multilingual Morpheme Ontology

    Get PDF
    In the last years a rapid emergence of lexical resources has evolved in the Semantic Web. Whereas most of the linguistic information is already machine-readable, we found that morphological information is mostly absent or only contained in semi-structured strings. An integration of morphemic data has not yet been undertaken due to the lack of existing domain-specific ontologies and explicit morphemic data. In this paper, we present the Multilingual Morpheme Ontology called MMoOn Core which can be regarded as the first comprehensive ontology for the linguistic domain of morphological language data. It will be described how crucial concepts like morphs, morphemes, word forms and meanings are represented and interrelated and how language-specific morpheme inventories can be created as a new possibility of morphological datasets. The aim of the MMoOn Core ontology is to serve as a shared semantic model for linguists and NLP researchers alike to enable the creation, conversion, exchange, reuse and enrichment of morphological language data across different data-dependent language sciences. Therefore, various use cases are illustrated to draw attention to the cross-disciplinary potential which can be realized with the MMoOn Core ontology in the context of the existing Linguistic Linked Data research landscape

    Creating Linked Data morphological language resources with MMoOn: the Hebrew Morpheme Inventory

    Get PDF
    The development of standard models for describing general lexical resources has led to the emergence of numerous lexical datasets of various languages in the Semantic Web. However, there are no models that describe the domain of Morphology in a similar manner. As a result, there are hardly any language resources of morphemic data available in RDF to date. This paper presents the creation of the Hebrew Morpheme Inventory from a manually compiled tabular dataset comprising around 52.000 entries. It is an ongoing effort of representing the lexemes, word-forms and morphologigal patterns together with their underlying relations based on the newly created Multilingual Morpheme Ontology (MMoOn). It will be shown how segmented Hebrew language data can be granularly described in a Linked Data format, thus, serving as an exemplary case for creating morpheme inventories of any inflectional language with MMoOn. The resulting dataset is described a) according to the structure of the underlying data format, b) with respect to the Hebrew language characteristic of building word-forms directly from roots, c) by exemplifying how inflectional information is realized and d) with regard to its enrichment with external links to sense resources

    OnLiT: an ontology for linguistic terminology

    No full text

    Front Matter, Table of Contents, Preface, Conference Organization

    No full text
    Front Matter, Table of Contents, Preface, Conference Organizatio
    corecore